Skip to content

Conversation

@jvsena42
Copy link
Member

@jvsena42 jvsena42 commented Jan 28, 2026

Fixes #739

This PR prevents channel state divergence during node shutdown by ensuring state is fully persisted before the service is destroyed.

Description

When the app was stopped while a 0-conf channel had uncommitted state updates, the client (LDK) could end up with a different commitment height than the LSP. On reconnect, the LSP detected this mismatch as "possible data loss" and force-closed the channel.

This PR adds two mitigations:

  1. Final sync before shutdown - Calls syncWallets() before stopping the node to ensure the latest channel state is persisted to VSS
  2. Blocking service shutdown - Uses runBlocking in onDestroy() to wait for the node to fully stop before the service is destroyed, with a 5-second timeout to avoid ANR

Preview

CJIT.webm
multiple-transactions-and-poor-signal.webm

QA Notes

1. Test graceful shutdown

  1. Open the app and create or use an existing Lightning channel
  2. Stop the app using the notification "Stop" button
  3. Check logcat for:
    • Performing final sync before shutdown…
    • Final sync completed
    • onDestroy started
    • onDestroy completed
  4. Verify the node stops without errors

2. Test restart after stop

  1. After stopping the app via notification, reopen it
  2. Verify the Lightning channel is still operational
  3. Verify no force-close occurred

3. Regression

  • Send and receive Lightning payments normally
  • Verify wallet balance updates correctly

@jvsena42 jvsena42 self-assigned this Jan 28, 2026
@jvsena42 jvsena42 marked this pull request as ready for review January 28, 2026 14:48
@jvsena42 jvsena42 requested a review from ovitrif January 28, 2026 14:48
@claude

This comment has been minimized.

@jvsena42 jvsena42 linked an issue Jan 28, 2026 that may be closed by this pull request
@jvsena42
Copy link
Member Author

DIdn't find problems in my tests, but this has a timing very difficult to reproduce

@claude

This comment has been minimized.

Copy link
Collaborator

@ovitrif ovitrif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably still good to use non-main thread

Look at runBlocking(dispatcher) option

@jvsena42 jvsena42 marked this pull request as draft January 30, 2026 11:23
@jvsena42
Copy link
Member Author

nit: Probably still good to use non-main thread

Look at runBlocking(dispatcher) option

The execution already switches to bgDispatcher because of suspend fun stop(): Result<Unit> = withContext(bgDispatcher) , but the main thread is still blocked waiting for runBlocking to complete

@jvsena42
Copy link
Member Author

jvsena42 commented Feb 2, 2026

todo: check iOS

@jvsena42 jvsena42 marked this pull request as ready for review February 2, 2026 13:22
Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I found one issue that should be addressed:

can consume entire timeout budget, preventing from executing

File: app/src/main/java/to/bitkit/services/LightningService.kt

The new syncWallets() call before shutdown shares the same 5-second timeout budget (from onDestroy) as the critical node.stop() call. Since syncWallets() is a network operation without its own timeout, it could take the full 5 seconds, leaving no time for node.stop() to execute.

This is problematic because:

  1. syncWallets() is a network call to VSS that can be slow or hang
  2. If it takes most/all of the 5 seconds, node.stop() never runs
  3. This leaves the node in a running but unmanaged state - worse than the original problem

In onTimeout(), the budget is only 2 seconds, making this even more likely to occur.

Suggested fix: Give syncWallets() its own sub-timeout (e.g., 2 seconds out of the 5-second budget) to guarantee node.stop() has time to execute:

runCatching {
    Logger.debug("Performing final sync before shutdown…", context = TAG)
    withTimeoutOrNull(2_000L) {
        ServiceQueue.LDK.background { node.syncWallets() }
    } ?: Logger.warn("Sync timed out, proceeding with shutdown", context = TAG)
    Logger.debug("Final sync completed", context = TAG)
}.onFailure { Logger.warn("Final sync failed, proceeding with shutdown", it, context = TAG) }

This ensures node.stop() always has at least 3 seconds to execute properly.

Copy link

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I've reviewed this PR and found 2 significant issues that should be addressed before merging.

Comment on lines +407 to +411
// On cancellation (e.g., timeout), ensure state is recoverable
if (it is CancellationException) {
Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)
_lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }
return@withLock Result.failure(it)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Catching CancellationException without rethrowing breaks coroutine cancellation contract

This code catches CancellationException via runCatching and wraps it in Result.failure() without rethrowing. This violates Kotlin's coroutine cancellation contract, which requires CancellationException to always be rethrown for proper structured concurrency.

The same file demonstrates the correct pattern in multiple places:

  • Line 223: // Cancellation is expected during pull-to-refresh, rethrow per Kotlin best practices followed by if (it is CancellationException) throw it
  • Line 861: Same pattern

Impact: While this happens to work for the primary onDestroy call site (since withContext(bgDispatcher) may detect parent cancellation independently), stop() is also called from:

  • wipeStorage() (line 543)
  • restartWithElectrumServer() (line 562)
  • restartWithRgsServer() (line 588)
  • restartWithPreviousConfig() (line 609)
  • restartNode() (line 1076)

If any of these call sites' coroutines are cancelled while stop() is running, the CancellationException will be silently swallowed, and subsequent code will execute when it should not.

Suggested fix: Rethrow the CancellationException after performing the state recovery:

Suggested change
// On cancellation (e.g., timeout), ensure state is recoverable
if (it is CancellationException) {
Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)
_lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }
return@withLock Result.failure(it)
// On cancellation (e.g., timeout), ensure state is recoverable
if (it is CancellationException) {
Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)
_lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) }
throw it // Rethrow to properly propagate cancellation
}

}.onFailure {
// On cancellation (e.g., timeout), ensure state is recoverable
if (it is CancellationException) {
Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Setting state to Stopped when JNI operations may still be running

When a timeout-induced CancellationException occurs, this code forces the state to NodeLifecycleState.Stopped. However, the native node.syncWallets() and node.stop() JNI calls may still be executing on the ServiceQueue.LDK single-threaded dispatcher. JNI calls are not interruptible by coroutine cancellation - they will continue running until completion.

Consequences:

  1. State machine corruption: The lifecycle state says Stopped but the node is still actively performing I/O, persisting state, etc.

  2. Mutex release allows concurrent operations: When return@withLock executes, the lifecycleMutex is released. If start() is called next (e.g., on app relaunch), it will:

    • Acquire the mutex
    • See state Stopped
    • Proceed with startup
    • Find lightningService.node is non-null (since line 259 in LightningService.kt hasn't executed yet)
    • Skip setup() and dispatch node.start() to the LDK queue
    • The still-running stop() block eventually completes and sets [email protected] = null
    • Result: Repo thinks node is Running, but service's node reference is null
  3. Recovery mechanism doesn't help: The stuck-Stopping recovery at lines 282-286 is never triggered in the timeout scenario because state was already forced to Stopped.

Possible solutions:

  1. Leave state as Stopping instead of forcing to Stopped - this allows the stuck-state recovery mechanism to handle it on next start()
  2. Add a flag in LightningService to track whether native operations are truly complete, independent of coroutine cancellation
  3. Rethink the timeout approach - consider whether the timeout should apply to the entire operation or just serve as a best-effort mechanism with acknowledgment that native cleanup continues in background

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CJIT channel force closed while attempting LNURL withdrawals

3 participants